Requirements

In [1]:
!pip install \
    numpy \
    pandas \
    xlrd \
    scipy \
    plotly \
    > /dev/null
In [2]:
import pandas as pd
import plotly.graph_objects as go
import numpy as np
from scipy.optimize import minimize
from scipy.special import beta
from scipy.stats import gaussian_kde
import plotly.figure_factory as ff
import scipy.stats as st
In [3]:
# load data
data = pd.read_csv('telco.txt', delimiter='\t')

data.head()
Out[3]:
tenure age marital address income ed employ retire gender longmon wiremon churn
1 13 44 Married 9 64 College degree 5 No Male 3.70 0.0 Yes
2 11 33 Married 7 136 Post-undergraduate degree 5 No Male 4.40 35.7 Yes
3 68 52 Married 24 116 Did not complete high school 29 No Female 18.15 0.0 No
4 33 33 Unmarried 12 33 High school degree 0 No Female 9.45 0.0 Yes
5 23 30 Married 9 30 Did not complete high school 2 No Male 6.30 0.0 No

Problem 3

1

(a)

From plots we can concluce that law of large numbers is hold. The bigger sample we take the more accurate estimated mean and variace gets to real one.

In [4]:
np.random.seed()

samples = np.arange(100, 100_000, 1000)
sample_distr = [np.random.normal(loc=1, scale=1, size=x) for x in samples]
means = [np.mean(x) for x in sample_distr]
variances = [np.var(x) for x in sample_distr]
In [5]:
mean_plot = go.Scatter(
    x=samples,
    y=means,
    mode='markers',
    name='sample of mean',
)
fig = go.Figure(data=[mean_plot])
fig.update_layout(
    title="Sample mean vs sample size",
    xaxis_title="sample size",
    yaxis_title="sample mean",
)
fig.show()

(b)

source

For a population with unknown mean and known standard deviation , a confidence interval for the population mean, based on a simple random sample (SRS) of size $n$, is $\hat{X} \pm z* \frac{\sigma}{\sqrt{n}}$, where z* is the upper $\frac{1 - C}{2}$ critical value for the standard normal distribution.

For $95\%$ of confidence interval $Z*$ is $1.96$ in this way we could calculate confidence interval for all sampled mean and variance.

First plot shows confidence intervals built based on known $\sigma = 1$.
Second plot uses estimated $\sigma$ form sample.

Conclusions:
Confidence itervals support the statement which follows from Law of big numbers that bigger sample size leads to better estimation of the parameters of the distribution.

With estimated $\sigma$ results have no significant difference, for smaller sample sizes there are some difference compared to read $\sigma$ value. From this we could conclude the for big sample sizes we could take estimated sigma for confidence intervals.

In [6]:
upper_cofindeces = [np.mean(x) + 1.96 * 1 / np.sqrt(len(x)) for x in sample_distr]
lower_cofindeces = [np.mean(x) - 1.96 * 1 / np.sqrt(len(x)) for x in sample_distr]
In [7]:
variance_plot = go.Scatter(
    x=samples,
    y=variances,
    mode='markers',
    name='sample of variance',
)
upper_confidence_plot = go.Scatter(
    x=samples,
    y=upper_cofindeces,
    mode='lines',
    name='95% confidence interval',
)
lower_confidence_plot = go.Scatter(
    x=samples,
    y=lower_cofindeces,
    mode='lines',
    name='95% confidence interval',
)
fig = go.Figure(data=[mean_plot, upper_confidence_plot, lower_confidence_plot])
fig.update_layout(
    title="Sample mean vs sample size (Known sigma)",
    xaxis_title="sample size",
    yaxis_title="sample mean",
)
fig.show()
In [8]:
upper_cofindeces = [np.mean(x) + 1.96 * np.var(x) / np.sqrt(len(x)) for x in sample_distr]
lower_cofindeces = [np.mean(x) - 1.96 * np.var(x) / np.sqrt(len(x)) for x in sample_distr]
In [9]:
variance_plot = go.Scatter(
    x=samples,
    y=variances,
    mode='markers',
    name='sample of variance',
)
upper_confidence_plot = go.Scatter(
    x=samples,
    y=upper_cofindeces,
    mode='lines',
    name='95% confidence interval',
)
lower_confidence_plot = go.Scatter(
    x=samples,
    y=lower_cofindeces,
    mode='lines',
    name='95% confidence interval',
)
fig = go.Figure(data=[mean_plot, upper_confidence_plot, lower_confidence_plot])
fig.update_layout(
    title="Sample mean vs sample size (Estimated sigma)",
    xaxis_title="sample size",
    yaxis_title="sample mean",
)
fig.show()

(d)

From the plot we could conclude that with the bigger sample size the estimated variance is closer to real one. Compared to estimated mean variance converges slower that could be due to the fact that we easimate square of the variance.

Consistency states that the more datapoints we took the closer estimated value be to real one. Based on plot I could conclude that estimated variance converges to real one, in this was sampled variance is an consistent estimator.

In [10]:
variance_plot = go.Scatter(
    x=samples,
    y=variances,
    mode='markers',
    name='sample of mean',
)

fig = go.Figure(data=[variance_plot])
fig.update_layout(
    title="Sample variance vs sample size",
    xaxis_title="sample size",
    yaxis_title="sample variance",
)
fig.show()

2

(a)

$f(x) = \frac{{1 + \frac{x^2}{df}} ^ {-\frac{df + 1}{2}}}{B(\frac{df}{2}, \frac{1}{2})\sqrt{df}}$

Likelihood function would be:

$L_v(x1, x2, ..., x_n) = \prod_{i=1}^nf(x)=\prod_{i=1}^n\frac{{1 + \frac{{x_i}^2}{df}} ^ {-\frac{df + 1}{2}}}{B(\frac{df}{2}, \frac{1}{2})\sqrt{df}}$

Since our aim is minimization of likelihood then we could minimize $log$ of likelihood that should simplify equation (replace product with sum).

$log(\prod_{i=1}^nf(x)) = \sum_{i=1}^nlog(\frac{{1 + \frac{{x_i}^2}{df}} ^ {-\frac{df + 1}{2}}}{B(\frac{df}{2}, \frac{1}{2})\sqrt{df}})$

After simplification we get:

$L_v(x1, x2, ..., x_n) = -\sum{\frac{(df + 1)}{2} * log(1 + \frac{x^2}{2}) + \frac{log(df)}{2} + log(B(\frac{df}{2}, \frac{1}{2})) }$

(b)

In [11]:
def likelihood(x, df):
    return -np.sum(
        (df + 1) / 2 * np.log(1 + x ** 2 / df) + 
        np.log(df) / 2 + np.log(beta(df / 2, 1 / 2))
    )
In [12]:
np.random.seed(3)
x = np.random.standard_t(5, 100)
minimized = minimize(lambda df: -likelihood(x, df), x0=1, tol = 0.001)
print(minimized)
      fun: 154.7873583895098
 hess_inv: array([[11.82475394]])
      jac: array([-4.57763672e-05])
  message: 'Optimization terminated successfully.'
     nfev: 36
      nit: 11
     njev: 12
   status: 0
  success: True
        x: array([7.29550962])

Note: since all optimization are make in terms of minimization we take $-loglikelihood$ and minimize it.
Minimization is not stable but on average gives values close to the real one. Sometimes minimization ends up with very high values.

In [13]:
np.random.seed(3)
x = np.random.standard_t(5, 5000)
minimized = minimize(lambda df: -likelihood(x, df), x0=1, tol = 0.001)
print(minimized)
      fun: 8067.249491738491
 hess_inv: array([[0.07566652]])
      jac: array([-0.00048828])
  message: 'Optimization terminated successfully.'
     nfev: 33
      nit: 10
     njev: 11
   status: 0
  success: True
        x: array([5.21827166])

From results we could conclude that the bigger the sample we get the better the log-likelihood of degree of freedom is estimated. The log-likelihood estimator is a consistent estimator.

3

(a)

In [14]:
np.random.seed(0)

distr = np.random.chisquare(2, (1000, 100))
means = np.mean(distr, axis=1)
vars = np.var(distr, axis=1)
In [15]:
sample_distr = go.Histogram(
    x=distr[0],
    name='sample hi^2 distribution',
)
fig = go.Figure(data=[sample_distr])
fig.update_layout(
    title="Sample hi^2 distr",
)
fig.show()

(b)

In [16]:
mean_plot = go.Histogram(
    x=means,
    name='sample of mean',
    opacity=0.8,

)
var_plot = go.Histogram(
    x=vars,
    name='sample of variances',
    opacity=0.8,

)
fig = ff.create_distplot([means, vars], ['mean', 'var'], bin_size=[0.01, 0.01])
fig.show()

From the plot we could see that sample means and sample variances form normal distribution. The more samples is taken the closer distribution is similar to normal. This fact comes from the CLT, it states that samples from distribuins (not only normal) forms normal distribution in terms of mean statistic, the same fact applies to variances.

(c)

The CLT applies for variaces due to the fact that variance depends on mean. Formula is given:

$\sigma = \frac{\sum_{i}^nx_i-\mu}{n}$

(d)

In [17]:
np.random.seed(0)

for N in (1e3, 1e4, 1e5,):
    distr = np.random.chisquare(2, (1000, int(N)))
    means = np.mean(distr, axis=1)
    vars = np.var(distr, axis=1)

    fig = ff.create_distplot([means, vars], ['mean', 'vars'], bin_size=[10 / N, 100 / N])
    fig.update_layout(
        title=f"Mean/Var (N = {N})",
    )
    fig.show()

4

(a)

In [18]:
np.random.seed(2)
N = 100
distr = np.random.normal(loc=500, scale=np.sqrt(50), size=N)
fig = ff.create_distplot([distr], ['distribution'], bin_size=[2])
fig.update_layout(
    title=f"loc=500, scale=50",
)
fig.show()

We do not know the standard deviation of the underlying population. Therefore, we need to use the sample standard deviation and shift to the Student's T distribution with N-1 degrees of freedom.
$t_{n-1} = \frac{\hat{x} - \mu}{\frac{\sigma}{\sqrt{n}}} $

In [19]:
t_st = (np.mean(distr) - 500) / np.std(distr) * np.sqrt(N)
m, std = np.mean(distr), np.std(distr)
print(f'Estimated mean: {m}, Estimated STD: {std}')
print(f'T-st: {t_st}')
print(f'Rejection area: (-∞, {st.t.ppf(0.02, N-1)}) and ({st.t.ppf(0.98, N-1)}, +∞)')
Estimated mean: 499.2664394074363, Estimated STD: 7.332950539872534
T-st: -1.0003621169610002
Rejection area: (-∞, -2.081161540890138) and (2.081161540890138, +∞)

Rejection area is $(-\inf, -2.0540168771640217)\text{ and }(2.0540168771640217, +\inf)$. Since $t_{test} = -1.00036$ we could not reject the $H_0$ hypothesis.

(b)

In [20]:
calculated_p = (1 - st.t.cdf(np.abs(t_st), N-1)) * 2
calc_t_st, build_in_p = st.ttest_1samp(distr, 500)
print(f'T-st: {calc_t_st}')
print(f'Calculated: {calculated_p}, Numpy: {build_in_p}')
T-st: -0.9953477389335783
Calculated: 0.319574143656018, Numpy: 0.32199378390116346

Resuls of manually calculated and build in t-statistics and p value are very similar. P-value shows how probable to get test results at least as extreme as the results actually observed during the test. If P-value less that $\alpha$ then we could accept the $H_0$

(c)

In [21]:
np.random.seed(0)
M = 1000
N = 100
sigma = np.sqrt(50)
mean = 500
alpha = 0.04

for i in range(1, 11):
    buf = []
    for _ in range(M):
        t, p = st.ttest_1samp(np.random.normal(mean, sigma, N), mean)
        buf.append(p < alpha)

    alpha_h = np.sum(buf) / len(buf)

    print(f'{i}: Empirical confidence level:', alpha_h)
1: Empirical confidence level: 0.051
2: Empirical confidence level: 0.032
3: Empirical confidence level: 0.033
4: Empirical confidence level: 0.037
5: Empirical confidence level: 0.034
6: Empirical confidence level: 0.036
7: Empirical confidence level: 0.049
8: Empirical confidence level: 0.044
9: Empirical confidence level: 0.038
10: Empirical confidence level: 0.033

Multiple runs shows that emperical confidence level is close to the $\alpha = 0.04$, that is strange for me since I expect it should be $0$. I think that could be due approximation of the real distribution $N$ with t distribution with $N-1$ degree if freedom. CLT based on the fact that $n \rightarrow \inf$ but our sample is finite.

(d)

In [22]:
np.random.seed(0)
M = 1000
N = 100
sigma = np.sqrt(50)
mean = 500
alpha = 0.04

for i in range(1, 11):
    buf = []
    for _ in range(M):
        distr = np.random.standard_t(3, N)
        t, p = st.ttest_1samp(mean + distr / np.sqrt(np.abs(distr)), mean)
        buf.append(p < alpha)

    alpha_h = np.sum(buf) / len(buf)

    print(f'{i}: Empirical confidence level:', alpha_h)
1: Empirical confidence level: 0.048
2: Empirical confidence level: 0.033
3: Empirical confidence level: 0.049
4: Empirical confidence level: 0.036
5: Empirical confidence level: 0.041
6: Empirical confidence level: 0.037
7: Empirical confidence level: 0.05
8: Empirical confidence level: 0.039
9: Empirical confidence level: 0.047
10: Empirical confidence level: 0.045
In [ ]:
 

(e)

In [23]:
from collections import Counter

np.random.seed(0)
M = 1000
N = 100
sigma = np.sqrt(50)
mean = 500
alpha = 0.05

buff = []
for df in range(2, 51):
    for _ in range(M):
        distr = np.random.standard_t(df, N)
        d, p = st.kstest(distr, 'norm')
        if p < alpha:
            buff.append(df)
In [24]:
data = Counter(buff)
fig = go.Figure(
    data=[
        go.Bar( x=list(data.keys()), y=list(data.values()),   )
    ]
)
fig.update_layout(
    title="H0 rejected per degree of freedom",
    xaxis_title="degree of freedom",
    yaxis_title="quantity of rejections",
)
fig.show()

From the plot we can see that with increasage degree of freedom in t distributions the number of rejected $H_0$ hypothesis deacreasing. This could be due to the fact that with bigger degree of freedom the t-distribution is more and more approximate normal distribution.
After $4-5$ degree of freedom quantity of rejections almost constant and is about $50$ ($5\%$ of all tests). This fact means that $\alpha = 0.05$ in this case and this is coinside with the probability of Type 2 error.

In [ ]: